LLMs training and tuning

See:

Prompt engineering

RAG

Resources

Maxime Labonne - A Beginner’s Guide to LLM Fine-Tuning
Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem | PyTorch
Fine-Tuning - LlamaIndex
Maxime Labonne - Fine-tune Mistral-7b with Direct Preference Optimization
Maxime Labonne - Fine-tune Llama 3 with ORPO
Maxime Labonne - Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth
Efficient Fine-tuning with PEFT and LoRA | Niklas Heidloff
From Zero to PPO: Understanding the Path to Helpful AI Models
- Pre-training
  - Goal: Train a large language model (LLM) on vast amounts of text data to predict the next token.
  - Objective: Minimize cross-entropy loss.
  - Outcome: A model with broad general knowledge but limited alignment to human intent (e.g., helpfulness, honesty).
- Supervised Fine-Tuning (SFT)
  - Purpose: Align the model with human-preferred conversational behaviors.
  - Process:
    - Fine-tune the pre-trained model on a curated dataset of high-quality, human-written prompt-response pairs.
  - Limitation: Constrained by dataset size and quality.
- Rejection Sampling
  - Purpose: Enhance response quality using human feedback.
  - Process:
    - Generate multiple responses per prompt.
    - Human annotators rank or select the best response(s).
    - Use the selected responses to further fine-tune the model.
  - Limitation: Time- and resource-intensive.
- Reward Modeling
  - Objective: Automate response evaluation to scale human feedback.
  - Process:
    1. Train a reward model to predict human preferences.
      - Input: Pairs of responses with human rankings.
      - Output: Scalar reward values representing response quality.
    2. Use the reward model to score new responses.
  - Advantage: Scalable and reduces dependency on manual evaluation.
- Reinforcement Learning with Human Feedback (RLHF)
  - Objective: Optimize responses using reinforcement learning based on the reward model.
  - Key Techniques:
    - Proximal Policy Optimization (PPO): Stable and efficient RL algorithm.
    - Training Loop:
      1. Generate a response for a prompt.
      2. Evaluate the response using the reward model.
      3. Update model parameters with PPO, balancing:
        
        Exploration: Discovering better responses.
        
        Exploitation: Refining known high-reward responses.
    - KL Regularization: Penalizes excessive divergence from the pre-trained policy to retain general knowledge.
  - Outcome: A model aligned with user intent and capable of producing helpful, safe, and relevant responses.

Comparison of LLM tuning strategies

RLHF

Code

#CODE Axolotl
- Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures
- Axolotl
#CODE Unsloth
#CODE Torchtune

Resources

Comparison of LLM tuning strategies

RLHF

Code

References